ci: shard tests to run more in parallel by chtruong814 · Pull Request #2345 · NVIDIA-NeMo/RL

chtruong814 · 2026-04-26T16:25:37Z

Summary

Replaces the monolithic L0 unit-test scripts with targeted shards grouped by backend marker and test domain.
- Backend catch-all shards cover mcore, automodel, vllm, sglang, and nemo_gym markers across the unit suite.
- Base domain shards cover models, algorithms, data, distributed, environments, and other unmarked tests.
- Large policy/model/vLLM groups are split with pytest-shard so CI can run them in parallel.
Replaces the monolithic L1 GPU functional script with framework- and algorithm-focused shards for Megatron, AutoModel, SGLang, Gym, GRPO, SFT, Eval, and Other tests.
Updates the GitHub Actions matrices to run the new L0, L1, GB200 L1, and Lfast shard sets in parallel.
Adds a test approval queue so the expanded shard matrix is gated by a concurrency-managed queue instead of allowing too many CICD workflows to run at once.
Adds shared unit-shard setup and makes tests/run_unit.sh treat pytest exit code 5 (no tests collected) as success for shard/FAST safety.
Had to increase the timeout of some vllm H100 tests for some reason. Also had to skip fp8 vllm tests. H100 had some failures. This is first time we are running tests on H100. Different issue than the GB200 issues reported in vllm generation with fp8 fails on gb200 and h100 #2081.

Test approval queue

Adds Approve Test Queue, a scheduled/manual workflow that uses the shared FW-CI test approval queue template for CICD NeMo RL.
Adds a cicd-wait-in-queue gate in the main workflow for PR Lfast/L0/L1/L2 runs before container builds and test jobs proceed.
Concurrency is controlled with repo variables: MAX_CONCURRENCY for internal runs and MAX_CONCURRENCY_EXTERNAL for external runs, both defaulting to 3.

SGLang default

SGLang build and SGLang unit/functional test shards are skipped by default through the SKIP_SGLANG workflow setting.
Set SKIP_SGLANG=false to build SGLang and run the SGLang shards.

Test plan

Verify the L0 unit shard matrix with CI:L0 or higher.
Verify the L1 functional shard matrix with CI:L1.
Verify CI:Lfast mode still applies FAST exclusions correctly.
Verify the test approval queue gates PR CICD runs and respects the configured concurrency limits.
Verify coverage artifacts upload and combine correctly across the new shard names.

Restructure unit test CI from 3 monolithic shards (Generation, Policy, Other) into 9 targeted shards split by extra/marker. Each extra-specific shard (mcore, automodel, vllm, sglang, nemo_gym) runs a single --*-only flag across all unit tests, while domain shards (models, environments, algorithms, other) run only base (unmarked) tests. This eliminates the 5-6 sequential pytest invocations per shard, reduces the bottleneck from 90 min (Policy) to ~30 min per shard, and makes it clear where new tests should be added. New shards: - L0_Unit_Tests_Vllm: base vllm generation + --vllm-only catch-all - L0_Unit_Tests_Sglang: base sglang files + --sglang-only catch-all - L0_Unit_Tests_Mcore: --mcore-only catch-all - L0_Unit_Tests_Automodel: --automodel-only catch-all - L0_Unit_Tests_Nemo_Gym: --nemo-gym-only catch-all - L0_Unit_Tests_Models: base model tests (minus generation) - L0_Unit_Tests_Environments: base environment tests - L0_Unit_Tests_Algorithms: base algorithm tests - L0_Unit_Tests_Other: catch-all for remaining base tests + research Also fixes run_unit.sh to treat pytest exit code 5 (no tests collected) as success, preventing shard failures when FAST exclusions remove all tests from a shard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com>

copy-pr-bot · 2026-04-26T16:25:40Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

chtruong814 · 2026-04-26T16:31:09Z

/ok to test

The truncated field depends on exact generation output from the tiny model, which is not reproducible across runs. Instead of comparing exact bool values, verify that each value is a bool type. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com>

The Mcore shard (50 min) and Automodel shard (38 min) are bottlenecked by heavy policy worker tests (test_megatron_worker.py and test_dtensor_worker*.py). Split each into two shards: - L0_Unit_Tests_Mcore: mcore tests excluding unit/models/policy/ (~15 min) - L0_Unit_Tests_Mcore_Policy: mcore tests from unit/models/policy/ only (~30 min) - L0_Unit_Tests_Automodel: automodel tests excluding unit/models/policy/ (~10 min) - L0_Unit_Tests_Automodel_Policy: automodel tests from unit/models/policy/ only (~28 min) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Split L0_Unit_Tests_Other into three shards: - L0_Unit_Tests_Data: data pipeline tests (datasets, processing, message utils) - L0_Unit_Tests_Distributed: distributed infra tests (worker groups, virtual cluster, logprob) - L0_Unit_Tests_Other: catch-all for remaining (experience, utils, tools, evals, rewards, root tests) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 · 2026-04-27T01:09:51Z

/ok to test

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 · 2026-04-27T01:14:30Z

/ok to test

The qwen2 parametrizations in test_megatron_policy_training, test_megatron_policy_logprobs, and test_megatron_policy_topk_logits are redundant — the assertions are model-agnostic (no NaN/Inf, correct shapes, loss decreases) and the Qwen->Megatron converter path is thoroughly covered by functional tests (grpo_megatron.sh, dpo_megatron.sh, sft_megatron.sh all use Qwen models). Removes 14 test instances: - training: 9 → 7 (dropped 2 qwen2 variants) - logprobs: 12 → 6 (dropped 6 qwen2 variants) - topk: 12 → 6 (dropped 6 qwen2 variants) Estimated savings: ~5-10 minutes on the Mcore_Policy shard. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com>

…re combos The training_setup fixture tested 5 model architectures (llama, qwen2, qwen3, gemma3, nemotron5_h) but the assertions are model-agnostic (no NaN/Inf, loss decreases, flops tracking). Model compatibility is covered by functional tests (grpo.sh, grpo_fsdp2.sh, dpo.sh, sft.sh use Qwen and Gemma models). Consolidate to llama-only while preserving all feature combinations (sp, cpu_offload, activation_checkpointing, cp, and their combos). Reduces from 23 → 10 parametrized test instances. Logprob_setup left unchanged since it validates numerical correctness via torch.allclose per architecture. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Guard the truncated field check with a key existence check since the expected_result dict no longer contains the truncated field. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com>

The truncated field was incorrectly removed from expected_result in an earlier commit. It should remain present so _standardize can validate the field contains bools before popping it from both sides. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 · 2026-04-27T13:04:51Z

/ok to test

Refactor test_megatron_worker.py to use a class-scoped Ray cluster fixture (TestMegatronTwoGPU) for the parametrized tests, following the same pattern as test_dtensor_worker.py's TestTwoGPUCluster. Previously, each parametrized test (training×7, generation×2, logprobs×6, topk×6 = 21 tests) created and destroyed its own RayVirtualCluster. Now they share a single class-scoped cluster, saving ~20 cluster creation/teardown cycles. Each test still creates and destroys its own Policy for isolation. Standalone tests (checkpoint, loss_independent, grad_norm, etc.) remain outside the class since they need custom cluster configs. Estimated savings: ~5-10 minutes from avoided cluster overhead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com>

…ests" This reverts commit 1ffeb76. Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 · 2026-04-27T13:16:16Z

/ok to test

chtruong814 · 2026-05-22T19:44:05Z

/ok to test 3a83519

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 · 2026-05-23T01:35:59Z

/ok to test 0863f96

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 · 2026-05-23T12:39:58Z

/ok to test 766d6f3

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 · 2026-05-26T19:01:36Z

/ok to test

chtruong814 · 2026-05-27T00:08:47Z

/ok to test

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

…ong/shard-tests Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

…ong/shard-tests Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 · 2026-05-29T00:25:31Z

/ok to test

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 · 2026-05-29T04:08:54Z

/ok to test

chtruong814 requested a review from a team as a code owner April 26, 2026 16:25

github-actions Bot added the CI Relating to CI label Apr 26, 2026

chtruong814 added CI:L1 Run doctests, unit tests, and functional tests CI:L0 Run doctests and unit tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Apr 26, 2026

copy-pr-bot Bot temporarily deployed to nemo-ci April 26, 2026 16:31 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 26, 2026 16:38 Inactive

chtruong814 and others added 3 commits April 26, 2026 19:58

copy-pr-bot Bot had a problem deploying to nemo-ci April 27, 2026 01:10 Error

Fix lint error in test_rollouts.py

7cc65b2

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

copy-pr-bot Bot temporarily deployed to nemo-ci April 27, 2026 01:15 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 27, 2026 01:18 Inactive

chtruong814 and others added 4 commits April 27, 2026 07:56

Fix lint error in test_rollouts.py

de4e5c7

Guard the truncated field check with a key existence check since the expected_result dict no longer contains the truncated field. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com>

copy-pr-bot Bot had a problem deploying to nemo-ci April 27, 2026 13:05 Error

chtruong814 and others added 2 commits April 27, 2026 08:12

Revert "perf: share Ray cluster across parametrized megatron policy t…

23e250f

…ests" This reverts commit 1ffeb76. Signed-off-by: Charlie Truong <chtruong@nvidia.com>

copy-pr-bot Bot had a problem deploying to nemo-ci April 27, 2026 13:17 Error

Merge branch 'main' into chtruong/shard-tests

9ce6119

kajalj22 previously approved these changes May 22, 2026

View reviewed changes

Comment thread tests/unit/test_recipes_and_test_suites.py

chtruong814 added 2 commits May 22, 2026 20:30

Merge remote-tracking branch 'origin/main' into chtruong/shard-tests

74e6e8b

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

ci: check functional scripts in workflow

0863f96

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

test: make dtensor flops check deterministic

766d6f3

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 added 2 commits May 26, 2026 13:16

Merge remote-tracking branch 'origin' into chtruong/shard-tests

5e8899a

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

test: collect coverage for other functional tests

0ea3ed4

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 commented May 26, 2026

View reviewed changes

Comment thread tests/unit/models/policy/test_dtensor_worker.py

chtruong814 commented May 26, 2026

View reviewed changes

Comment thread tests/unit/models/generation/test_vllm_generation.py

chtruong814 commented May 26, 2026

View reviewed changes

Comment thread tests/unit/experience/test_rollouts.py

Merge branch 'main' into chtruong/shard-tests

4346d57

kajalj22 previously approved these changes May 27, 2026

View reviewed changes

terrykong mentioned this pull request May 27, 2026

Unit test shard selection #868

Closed

terrykong reviewed May 28, 2026

View reviewed changes

Comment thread .github/workflows/cicd-main.yml

Comment thread tests/unit/models/generation/test_vllm_generation.py

Comment thread tests/run_unit.sh

Comment thread tests/unit/models/generation/test_vllm_generation.py

chtruong814 added 7 commits May 28, 2026 18:58

Merge remote-tracking branch 'origin/main' into chtruong/shard-tests

d1867c9

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

ci: address test shard review feedback

f1b5e86

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Merge remote-tracking branch 'origin/chtruong/shard-tests' into chtru…

1bb8976

…ong/shard-tests Signed-off-by: Charlie Truong <chtruong@nvidia.com>

test: rename fp8 vllm skip helper

f711711

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Merge branch 'main' into chtruong/shard-tests

8ed1f2a

Merge remote-tracking branch 'origin' into chtruong/shard-tests

6ee68c9

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Merge remote-tracking branch 'origin/chtruong/shard-tests' into chtru…

25fb08a

…ong/shard-tests Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Merge remote-tracking branch 'origin/main' into chtruong/shard-tests

47c71b7

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

terrykong approved these changes May 29, 2026

View reviewed changes

achartier mentioned this pull request May 29, 2026

ci: guard coverage combine against empty coverage glob in fast shards #2630

Merged

Uh oh!

Conversation

chtruong814 commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test approval queue

SGLang default

Test plan

Uh oh!

copy-pr-bot Bot commented Apr 26, 2026

Uh oh!

chtruong814 commented Apr 26, 2026

Uh oh!

chtruong814 commented Apr 27, 2026

Uh oh!

chtruong814 commented Apr 27, 2026

Uh oh!

chtruong814 commented Apr 27, 2026

Uh oh!

chtruong814 commented Apr 27, 2026

Uh oh!

chtruong814 commented May 22, 2026

Uh oh!

Uh oh!

chtruong814 commented May 23, 2026

Uh oh!

chtruong814 commented May 23, 2026

Uh oh!

chtruong814 commented May 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chtruong814 commented May 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chtruong814 commented May 29, 2026

Uh oh!

chtruong814 commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chtruong814 commented Apr 26, 2026 •

edited

Loading